–Data wrangling–
–For problem 1–
–For problem 2–
# load packages
library(tidyverse)
library(corrplot)
library(lubridate)
library(broom)
library(reshape2)
# load and check dataset
wq <- read_csv("BayDeltaWQ.csv",
col_names = TRUE,
na = c("NA", "n/p", "n/a"),
guess_max = 30000)
|=== | 3%
|==== | 3%
|==== | 4%
|==== | 4%
|===== | 4%
|===== | 5%
|===== | 5%
|====== | 5%
|====== | 5%
|====== | 6%
|======= | 6%
|======= | 6%
|======== | 7%
|======== | 7%
|======== | 7%
|========= | 8%
|========= | 8%
|========= | 8%
|========== | 9%
|========== | 9%
|========== | 9%
|=========== | 10%
|========== | 10% 1 MB
|========== | 10% 1 MB
|=========== | 10% 1 MB
|=========== | 11% 1 MB
|=========== | 11% 1 MB
|============ | 11% 1 MB
|============ | 12% 1 MB
|============ | 12% 1 MB
|============= | 12% 1 MB
|============= | 13% 1 MB
|============= | 13% 1 MB
|============== | 13% 1 MB
|============== | 14% 1 MB
|============== | 14% 1 MB
|=============== | 14% 1 MB
|=============== | 15% 1 MB
|=============== | 15% 1 MB
|================ | 15% 1 MB
|================ | 16% 1 MB
|================ | 16% 1 MB
|================= | 16% 1 MB
|================= | 16% 1 MB
|================= | 17% 1 MB
|================== | 17% 1 MB
|================== | 17% 1 MB
|================== | 18% 1 MB
|=================== | 18% 1 MB
|=================== | 18% 1 MB
|=================== | 19% 1 MB
|==================== | 19% 1 MB
|==================== | 19% 1 MB
|==================== | 20% 1 MB
|===================== | 20% 2 MB
|===================== | 20% 2 MB
|===================== | 21% 2 MB
|====================== | 21% 2 MB
|====================== | 21% 2 MB
|====================== | 22% 2 MB
|======================= | 22% 2 MB
|======================= | 22% 2 MB
|======================= | 22% 2 MB
|======================= | 23% 2 MB
|======================== | 23% 2 MB
|======================== | 23% 2 MB
|======================== | 24% 2 MB
|========================= | 24% 2 MB
|========================= | 24% 2 MB
|========================= | 25% 2 MB
|========================== | 25% 2 MB
|========================== | 25% 2 MB
|========================== | 26% 2 MB
|=========================== | 26% 2 MB
|=========================== | 26% 2 MB
|=========================== | 27% 2 MB
|============================ | 27% 2 MB
|============================ | 27% 2 MB
|============================ | 28% 2 MB
|============================= | 28% 2 MB
|============================= | 28% 2 MB
|============================= | 28% 2 MB
|============================== | 29% 2 MB
|============================== | 29% 2 MB
|============================== | 29% 2 MB
|=============================== | 30% 2 MB
|=============================== | 30% 3 MB
|=============================== | 30% 3 MB
|================================ | 31% 3 MB
|================================ | 31% 3 MB
|================================ | 31% 3 MB
|================================= | 32% 3 MB
|================================= | 32% 3 MB
|================================= | 32% 3 MB
|================================== | 33% 3 MB
|================================== | 33% 3 MB
|================================== | 33% 3 MB
|=================================== | 34% 3 MB
|=================================== | 34% 3 MB
|=================================== | 34% 3 MB
|==================================== | 34% 3 MB
|==================================== | 35% 3 MB
|==================================== | 35% 3 MB
|===================================== | 35% 3 MB
|===================================== | 36% 3 MB
|===================================== | 36% 3 MB
|===================================== | 36% 3 MB
|====================================== | 37% 3 MB
|====================================== | 37% 3 MB
|====================================== | 37% 3 MB
|======================================= | 38% 3 MB
|======================================= | 38% 3 MB
|======================================= | 38% 3 MB
|======================================== | 39% 3 MB
|======================================== | 39% 3 MB
|======================================== | 39% 3 MB
|========================================= | 40% 3 MB
|========================================= | 40% 3 MB
|========================================= | 40% 4 MB
|========================================== | 41% 4 MB
|========================================== | 41% 4 MB
|========================================== | 41% 4 MB
|=========================================== | 42% 4 MB
|=========================================== | 42% 4 MB
|=========================================== | 42% 4 MB
|============================================ | 42% 4 MB
|============================================ | 43% 4 MB
|============================================ | 43% 4 MB
|============================================= | 43% 4 MB
|============================================= | 44% 4 MB
|============================================= | 44% 4 MB
|============================================== | 44% 4 MB
|============================================== | 45% 4 MB
|============================================== | 45% 4 MB
|=============================================== | 45% 4 MB
|=============================================== | 46% 4 MB
|=============================================== | 46% 4 MB
|================================================ | 46% 4 MB
|================================================ | 47% 4 MB
|================================================ | 47% 4 MB
|================================================= | 47% 4 MB
|================================================= | 48% 4 MB
|================================================= | 48% 4 MB
|================================================== | 48% 4 MB
|================================================== | 49% 4 MB
|================================================== | 49% 4 MB
|=================================================== | 49% 4 MB
|=================================================== | 50% 4 MB
|=================================================== | 50% 4 MB
|==================================================== | 50% 4 MB
|==================================================== | 50% 5 MB
|==================================================== | 51% 5 MB
|===================================================== | 51% 5 MB
|===================================================== | 51% 5 MB
|===================================================== | 52% 5 MB
|====================================================== | 52% 5 MB
|====================================================== | 52% 5 MB
|====================================================== | 53% 5 MB
|======================================================= | 53% 5 MB
|======================================================= | 53% 5 MB
|======================================================= | 54% 5 MB
|======================================================== | 54% 5 MB
|======================================================== | 54% 5 MB
|======================================================== | 55% 5 MB
|========================================================= | 55% 5 MB
|========================================================= | 55% 5 MB
|========================================================= | 56% 5 MB
|========================================================== | 56% 5 MB
|========================================================== | 56% 5 MB
|========================================================== | 57% 5 MB
|=========================================================== | 57% 5 MB
|=========================================================== | 57% 5 MB
|=========================================================== | 58% 5 MB
|============================================================ | 58% 5 MB
|============================================================ | 58% 5 MB
|============================================================ | 58% 5 MB
|============================================================= | 59% 5 MB
|============================================================= | 59% 5 MB
|============================================================= | 59% 5 MB
|============================================================== | 60% 5 MB
|============================================================== | 60% 5 MB
|============================================================== | 60% 6 MB
|=============================================================== | 61% 6 MB
|=============================================================== | 61% 6 MB
|=============================================================== | 61% 6 MB
|================================================================ | 62% 6 MB
|================================================================ | 62% 6 MB
|================================================================ | 62% 6 MB
|================================================================= | 63% 6 MB
|================================================================= | 63% 6 MB
|================================================================= | 63% 6 MB
|================================================================== | 64% 6 MB
|================================================================== | 64% 6 MB
|================================================================== | 64% 6 MB
|=================================================================== | 65% 6 MB
|=================================================================== | 65% 6 MB
|=================================================================== | 65% 6 MB
|==================================================================== | 66% 6 MB
|==================================================================== | 66% 6 MB
|==================================================================== | 66% 6 MB
|===================================================================== | 67% 6 MB
|===================================================================== | 67% 6 MB
|===================================================================== | 67% 6 MB
|====================================================================== | 67% 6 MB
|====================================================================== | 68% 6 MB
|====================================================================== | 68% 6 MB
|====================================================================== | 68% 6 MB
|======================================================================= | 69% 6 MB
|======================================================================= | 69% 6 MB
|======================================================================= | 69% 6 MB
|======================================================================== | 70% 6 MB
|======================================================================== | 70% 6 MB
|======================================================================== | 70% 6 MB
|========================================================================= | 71% 7 MB
|========================================================================= | 71% 7 MB
|========================================================================= | 71% 7 MB
|========================================================================== | 72% 7 MB
|========================================================================== | 72% 7 MB
|========================================================================== | 72% 7 MB
|=========================================================================== | 73% 7 MB
|=========================================================================== | 73% 7 MB
|=========================================================================== | 73% 7 MB
|============================================================================ | 74% 7 MB
|============================================================================ | 74% 7 MB
|============================================================================ | 74% 7 MB
|============================================================================= | 75% 7 MB
|============================================================================= | 75% 7 MB
|============================================================================= | 75% 7 MB
|============================================================================== | 75% 7 MB
|============================================================================== | 76% 7 MB
|============================================================================== | 76% 7 MB
|=============================================================================== | 76% 7 MB
|=============================================================================== | 77% 7 MB
|=============================================================================== | 77% 7 MB
|================================================================================ | 77% 7 MB
|================================================================================ | 78% 7 MB
|================================================================================ | 78% 7 MB
|================================================================================= | 78% 7 MB
|================================================================================= | 79% 7 MB
|================================================================================= | 79% 7 MB
|================================================================================== | 79% 7 MB
|================================================================================== | 80% 7 MB
|================================================================================== | 80% 7 MB
|=================================================================================== | 80% 7 MB
|=================================================================================== | 81% 7 MB
|=================================================================================== | 81% 8 MB
|==================================================================================== | 81% 8 MB
|==================================================================================== | 82% 8 MB
|==================================================================================== | 82% 8 MB
|===================================================================================== | 82% 8 MB
|===================================================================================== | 83% 8 MB
|===================================================================================== | 83% 8 MB
|====================================================================================== | 83% 8 MB
|====================================================================================== | 83% 8 MB
|====================================================================================== | 84% 8 MB
|======================================================================================= | 84% 8 MB
|======================================================================================= | 84% 8 MB
|======================================================================================= | 85% 8 MB
|======================================================================================== | 85% 8 MB
|======================================================================================== | 85% 8 MB
|======================================================================================== | 86% 8 MB
|========================================================================================= | 86% 8 MB
|========================================================================================= | 86% 8 MB
|========================================================================================= | 87% 8 MB
|========================================================================================== | 87% 8 MB
|========================================================================================== | 87% 8 MB
|========================================================================================== | 88% 8 MB
|=========================================================================================== | 88% 8 MB
|=========================================================================================== | 88% 8 MB
|=========================================================================================== | 89% 8 MB
|============================================================================================ | 89% 8 MB
|============================================================================================ | 89% 8 MB
|============================================================================================ | 89% 8 MB
|============================================================================================= | 90% 8 MB
|============================================================================================= | 90% 8 MB
|============================================================================================= | 90% 8 MB
|============================================================================================== | 91% 8 MB
|============================================================================================== | 91% 9 MB
|============================================================================================== | 91% 9 MB
|============================================================================================== | 92% 9 MB
|=============================================================================================== | 92% 9 MB
|=============================================================================================== | 92% 9 MB
|=============================================================================================== | 93% 9 MB
|================================================================================================ | 93% 9 MB
|================================================================================================ | 93% 9 MB
|================================================================================================ | 94% 9 MB
|================================================================================================= | 94% 9 MB
|================================================================================================= | 94% 9 MB
|================================================================================================= | 95% 9 MB
|================================================================================================== | 95% 9 MB
|================================================================================================== | 95% 9 MB
|================================================================================================== | 96% 9 MB
|=================================================================================================== | 96% 9 MB
|=================================================================================================== | 96% 9 MB
|=================================================================================================== | 96% 9 MB
|==================================================================================================== | 97% 9 MB
|==================================================================================================== | 97% 9 MB
|==================================================================================================== | 97% 9 MB
|===================================================================================================== | 98% 9 MB
|===================================================================================================== | 98% 9 MB
|===================================================================================================== | 98% 9 MB
|======================================================================================================| 99% 9 MB
|======================================================================================================| 99% 9 MB
|======================================================================================================| 99% 9 MB
|=======================================================================================================| 100% 9 MB
wq
# pull out month and year from SampleDate--
# to allow for grouping by year and month
wq_bymonth <- wq %>%
mutate(Year = lubridate::year(SampleDate)) %>% # add Year column
mutate(Month = lubridate::month(SampleDate)) %>% # add Month column
group_by(Year, Month) %>% # # group rows by year and month
select(Month, everything()) %>% # move Month column as 1st column
select(Year, everything()) # move Year column as 1st column
wq_bymonth
# filter rows to select only those sampled from--
# October 2004 to September 2012 (water years 2005-2012)
wq_bymonth <- wq_bymonth %>%
#group_by(Year, Month) %>% # group rows by year and month
filter(Year > 2003) %>% # select observations after 2003
filter(!(Year==2004 && Month<10)) %>% # take out January-September 2004
filter(!(Year==2012 && Month>9)) # take out October-December 2012
wq_bymonth
# summarize the column values to the mean monthly--
# so that each month in each year has a single value--
# which is the mean of all values in that month.
# this will result in 96 rows where each row is the--
# mean value for that month in that particular year--
# (12 months x 8 years = 96 months)
wq_meanmonthly <- wq_bymonth %>%
summarise_if(is.numeric, mean, na.rm = TRUE) %>%
# take the mean values of each month of--
# each year only if the column type is a numeric
select(-X1) %>%
# take out index column (not useful)
select(`Chlorophyll a`, everything())
# move Chl-a column as first column
wq_meanmonthly
This wq_meanmonthly dataset is what will be used in preparing datasets for problems 1 and 2.
# remove NaN values
wq_meanmonthly_mod <- wq_meanmonthly %>%
select_if(~sum(!is.na(.)) > 0) %>%
# remove columns with values that are all NaN
select_if(~sum(!is.na(.)) == nrow(wq_meanmonthly))
# remove columns which have NaN values
wq_meanmonthly_mod
# remove variables that are not water quality variables
wq_meanmonthly_mod <- wq_meanmonthly_mod %>%
ungroup() %>%
select(-Year, -Month, -Depth)
wq_meanmonthly_mod
Now I will use the wq_meanmonthly_mod dataset for the model selection.
# use stepwise to reduce the number of variables
wq_step <- step(lm(`Chlorophyll a` ~ ., data = wq_meanmonthly_mod), trace = 0)
# stepwise regression; trace = 0 returns only the final model of the stepwise
wq_step
Call:
lm(formula = `Chlorophyll a` ~ `Conductance (EC)` + Oxygen +
Temperature + `Ammonia (Dissolved)` + `Kjeldahl Nitrogen (Total)` +
`Organic Nitrogen (Dissolved)` + `Pheophytin a` + `Solids (Total Dissolved)`,
data = wq_meanmonthly_mod)
Coefficients:
(Intercept) `Conductance (EC)` Oxygen
-2.818e+01 -3.190e-04 2.288e+00
Temperature `Ammonia (Dissolved)` `Kjeldahl Nitrogen (Total)`
5.800e-01 -8.995e+00 6.765e+00
`Organic Nitrogen (Dissolved)` `Pheophytin a` `Solids (Total Dissolved)`
-4.271e+00 8.332e-01 5.821e-04
# the stepwise regression resulted in a model with--
# 8 predictor variables instead of the initial 19 variables
# make a new dataset containing only the 8 selected variables
wq_model <- select(wq_meanmonthly_mod,
`Chlorophyll a`,
`Conductance (EC)`,
Oxygen,
Temperature,
`Ammonia (Dissolved)`,
`Kjeldahl Nitrogen (Total)`,
`Organic Nitrogen (Dissolved)`,
`Pheophytin a`,
`Solids (Total Dissolved)`)
wq_model
# let's check if the model with 8 predictor variables is actually--
# better than with 19 predictor variables
lm_19v <- lm(`Chlorophyll a` ~ ., data = wq_meanmonthly_mod)
lm_8v <- lm(`Chlorophyll a` ~ ., data = wq_model)
lms <- list(w19v = lm_19v, w8v = lm_8v)
lms.stats <- mapply(glance, lms)
colnames(lms.stats) <- names(lms)
lms.stats
w19v w8v
r.squared 0.6170207 0.5929971
adj.r.squared 0.5212759 0.5555716
sigma 1.594462 1.536287
statistic 6.444429 15.84471
p.value 1.66057e-09 3.569059e-14
df 20 9
logLik -169.7921 -172.7124
AIC 381.5842 365.4247
BIC 435.4355 391.0682
deviance 193.2155 205.3356
df.residual 76 87
Although maybe not by much, the model from the stepwise regression (with 8 predictor variables) has a higher adj-r-square and test statistic with lower p-value, AIC, and BIC. From these values, the model with 8 predictor variables do seem to be better.
Having 8 predictor variables still sounds like a lot of variables. Let’s see if variable reduction can still be done by checking potential correlations between the predictor variables.
# --- (1) let's visualize the relationships between Chl-a and the 8 variables
plot(wq_model, pch=16, col="blue", cex = 0.5, main="Model: Chl-a ~ 8 variables")
From the plot, it looks like the following variables are potentially highly correlated:
# --- (2) let's check with correlation plot
mycor <- cor(wq_model)
cex.before <- par("cex")
par(cex = 0.7)
corrplot(mycor, method = "number", tl.cex = 1/par("cex"),
cl.cex = 1/par("cex"))
par(cex = cex.before)
The correlation plot verifies that the following variables are highly correlated:
Should some variables be removed? But which ones?
# --- (3) check regression summary to determine which variable to remove
chla_all <- lm(`Chlorophyll a` ~ ., data = wq_model)
summary(chla_all)
Call:
lm(formula = `Chlorophyll a` ~ ., data = wq_model)
Residuals:
Min 1Q Median 3Q Max
-2.5837 -0.7604 -0.1729 0.5220 8.1688
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.818e+01 7.968e+00 -3.536 0.000653 ***
`Conductance (EC)` -3.190e-04 1.838e-04 -1.736 0.086158 .
Oxygen 2.288e+00 6.500e-01 3.519 0.000690 ***
Temperature 5.800e-01 1.417e-01 4.092 9.52e-05 ***
`Ammonia (Dissolved)` -8.995e+00 5.672e+00 -1.586 0.116359
`Kjeldahl Nitrogen (Total)` 6.765e+00 2.741e+00 2.469 0.015522 *
`Organic Nitrogen (Dissolved)` -4.271e+00 2.764e+00 -1.545 0.125911
`Pheophytin a` 8.332e-01 2.971e-01 2.804 0.006219 **
`Solids (Total Dissolved)` 5.821e-04 3.324e-04 1.751 0.083415 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.536 on 87 degrees of freedom
Multiple R-squared: 0.593, Adjusted R-squared: 0.5556
F-statistic: 15.84 on 8 and 87 DF, p-value: 3.569e-14
Basing on the p-values and the variable correlations above:
Let’s see if additional reduction improves the model by comparing the 8-variable model and some other models derived from combinations of the variables.
# --- null
chla_null <- lm(`Chlorophyll a` ~ 1, data = wq_model)
# --- all 8 variables
chla_all <- lm(`Chlorophyll a` ~ ., data = wq_model)
# --- 4 variables: EC, Temp, Kj N, Pheo a
chla_ec.tmp.kjn.phe <- lm(`Chlorophyll a` ~
`Conductance (EC)` +
Temperature +
`Kjeldahl Nitrogen (Total)` +
`Pheophytin a`,
data = wq_model)
# --- 3 variables: EC, Temp, Pheo a
# EC, Ammonia, Pheo a
# Temp, Kj N, Pheo a
chla_ec.tmp.phe <- lm(`Chlorophyll a` ~
`Conductance (EC)` +
Temperature +
`Pheophytin a`,
data = wq_model)
chla_ec.amm.phe <- lm(`Chlorophyll a` ~
`Conductance (EC)` +
`Ammonia (Dissolved)` +
`Pheophytin a`,
data = wq_model)
chla_tmp.kjn.phe <- lm(`Chlorophyll a` ~
Temperature +
`Kjeldahl Nitrogen (Total)` +
`Pheophytin a`,
data = wq_model)
# --- 2 variables: Temp and Pheo a
# Ammonia and Pheo a
# Kj N and Pheo a
# Temp and Kj N
chla_tmp.phe <- lm(`Chlorophyll a` ~
Temperature +
`Pheophytin a`,
data = wq_model)
chla_amm.phe <- lm(`Chlorophyll a` ~
`Ammonia (Dissolved)` +
`Pheophytin a`,
data = wq_model)
chla_kjn.phe <- lm(`Chlorophyll a` ~
`Kjeldahl Nitrogen (Total)` +
`Pheophytin a`,
data = wq_model)
chla_tmp.kjn <- lm(`Chlorophyll a` ~
Temperature +
`Kjeldahl Nitrogen (Total)`,
data = wq_model)
lms_compare <- list(null=chla_null,
all=chla_all,
ec.tmp.kjn.phe=chla_ec.tmp.kjn.phe,
ec.tmp.phe=chla_ec.tmp.phe,
ec.amm.phe=chla_ec.amm.phe,
tmp.kjn.phe=chla_tmp.kjn.phe,
tmp.phe=chla_tmp.phe,
amm.phe=chla_amm.phe,
kjn.phe=chla_kjn.phe,
tmp.kjn=chla_tmp.kjn)
lms_compare.stats <- mapply(glance, lms_compare)
colnames(lms_compare.stats) <- names(lms_compare)
lms_compare.stats
null all ec.tmp.kjn.phe ec.tmp.phe ec.amm.phe tmp.kjn.phe tmp.phe amm.phe
r.squared 0 0.5929971 0.4806624 0.4790654 0.4728065 0.4574591 0.4570602 0.4728041
adj.r.squared 0 0.5555716 0.4578344 0.4620784 0.4556154 0.4397676 0.4453841 0.4614666
sigma 2.304473 1.536287 1.696827 1.690173 1.700296 1.724867 1.716199 1.691133
statistic NA 15.84471 21.0558 28.20188 27.50299 25.85749 39.14486 41.70251
p.value NA 3.569059e-14 2.583643e-12 5.064252e-13 8.71719e-13 3.212171e-12 4.633964e-13 1.1795e-13
df 1 9 5 4 4 4 3 3
logLik -215.8612 -172.7124 -184.4116 -184.559 -185.1322 -186.5096 -186.5449 -185.1325
AIC 435.7225 365.4247 380.8232 379.118 380.2645 383.0193 381.0898 378.2649
BIC 440.8512 391.0682 396.2093 391.9397 393.0862 395.841 391.3472 388.5223
deviance 504.5064 205.3356 262.0091 262.8148 265.9725 273.7153 273.9166 265.9737
df.residual 95 87 91 92 92 92 93 93
kjn.phe tmp.kjn
r.squared 0.3863068 0.2691238
adj.r.squared 0.3731091 0.253406
sigma 1.824599 1.991195
statistic 29.27076 17.12227
p.value 1.379546e-10 4.663771e-07
df 3 3
logLik -192.4248 -200.8127
AIC 392.8495 409.6254
BIC 403.1069 419.8828
deviance 309.6121 368.7317
df.residual 93 93
By comparing the models above, the model with all 8 variables actually had the highest adj-r-squared value compared to the models with less predictor variables. However, in the case that fewer variables are available for measurement, I would select the following based on comparison of values of the adj-r-square, AIC/BIC, p-value, and test statistic.
Chl-a ~ Ammonia + Pheophytin aThis model is able to explain 46% of the variation in Chl-a compared to 55% when using all 8 variables.
Let’s compare the residuals and predicted values for the model with 8 predictor variables and the model Chl-a ~ Ammonia + Pheophtin a.
# plot residuals
par(mfrow = c(2, 2))
plot(chla_all, pch = 16, which = 1)
plot(chla_all, pch = 16, which = 2)
plot(chla_amm.phe, pch = 16, which = 1)
plot(chla_amm.phe, pch = 16, which = 2)
par(mfrow = c(1, 1))
Top: all 8 variables; Bottom: Chl-a ~ Ammonia + Pheophytin a.
# plot predicted vs actual
par(mfrow = c(1, 2))
plot(predict(chla_all),wq_model$`Chlorophyll a`,
xlab="predicted",ylab="actual", main ="Chl-a ~ .")
abline(a=0,b=1)
plot(predict(chla_amm.phe),wq_model$`Chlorophyll a`,
xlab="predicted",ylab="actual", main ="Chl-a ~ Ammonia + Pheophytin a")
abline(a=0,b=1)
par(mfrow = c(1, 1))
The selected multiple regression model has 2 variables. Is ther one best predictor for Chl-a concentration? I look at 3 variables: Temperature, Ammonia, and Pheophytin a. These 3 variables have the highest correlation value to Chl-a based on the correlation plot.
# lm for each of the 3 variables
chla_tmp <- lm(`Chlorophyll a` ~ Temperature, data = wq_model)
chla_amm <- lm(`Chlorophyll a` ~ `Ammonia (Dissolved)`, data = wq_model)
chla_phe <- lm(`Chlorophyll a` ~ `Pheophytin a`, data = wq_model)
# compare lms of the 3 variables
lms_best <- list(null=chla_null,
all=chla_all,
tmp=chla_tmp,
amm=chla_amm,
phe=chla_phe)
lms_best.stats <- mapply(glance, lms_best)
colnames(lms_best.stats) <- names(lms_best)
lms_best.stats
null all tmp amm phe
r.squared 0 0.5929971 0.2595211 0.1718766 0.3558155
adj.r.squared 0 0.5555716 0.2516436 0.1630668 0.3489624
sigma 2.304473 1.536287 1.993544 2.108225 1.859407
statistic NA 15.84471 32.94487 19.50966 51.92092
p.value NA 3.569059e-14 1.15385e-07 2.675906e-05 1.426426e-10
df 1 9 2 2 2
logLik -215.8612 -172.7124 -201.4393 -206.8088 -194.7523
AIC 435.7225 365.4247 408.8785 419.6176 395.5046
BIC 440.8512 391.0682 416.5716 427.3106 403.1976
deviance 504.5064 205.3356 373.5763 417.7935 324.9952
df.residual 95 87 94 94 94
From the comparison, I would say that the most important variable explaining Chl-a is Pheophytin a which is able to explain 35% of the variability in Chl-a compared to 25% with Temperature and only 16% with Ammonia.
# plot residuals
par(mfrow=c(1,2))
plot(chla_phe, pch=16, which=1)
plot(chla_phe, pch=16, which=2)
par(mfrow=c(1,1))
The are somewhat normal and distributed along the 0 line although improvements may need to be done, such as investigation potential outliers (57, 45, and 3).
# plot predicted vs actual
plot(predict(chla_phe),wq_model$`Chlorophyll a`,
xlab="predicted",ylab="actual", main = "Chl-a ~ Pheophytin a")
abline(a=0,b=1)
# plot Chl-a vs Pheophytin a actual values with prediction line from model
ggplot(wq_model, aes(x = `Pheophytin a`, y = `Chlorophyll a`)) +
geom_point() +
geom_line(aes(y = predict(chla_phe)), shape = 1)
The predicted values from the model are somewhat predicting the trend of the actual values.
# add seasons
wq_meanmonthly_season <- wq_meanmonthly
wq_meanmonthly_season$Season <- ifelse(wq_meanmonthly_season$Month > 9 |
wq_meanmonthly_season$Month < 5, "wet season", "dry season")
wq_season <- wq_meanmonthly_season %>%
ungroup() %>%
select(Season, everything()) %>%
select(Month, everything()) %>%
select(Year, everything())
wq_season$Season <- as.factor(wq_season$Season)
# create new dataset for parallel regression model
wq_prl_model <- select(wq_season,
`Chlorophyll a`,
Season,
`Pheophytin a`)
# perform lms
# Chl-a ~ season
chla_season <- lm(`Chlorophyll a` ~ Season, data = wq_prl_model)
# Chl-a ~ Pheophytin a
chla_phea <- lm(`Chlorophyll a` ~ `Pheophytin a`, data = wq_prl_model)
# Chl-a ~ season + pheophytin a
chla_season.phea <- lm(`Chlorophyll a` ~ Season + `Pheophytin a`, data = wq_prl_model)
# compare models
lms_prl <- list(season=chla_season, phea=chla_phea, season.phea=chla_season.phea)
lms_prl.stats <- mapply(glance, lms_prl)
colnames(lms_prl.stats) <- names(lms_prl)
lms_prl.stats
season phea season.phea
r.squared 0.2360985 0.3558155 0.4413055
adj.r.squared 0.2279719 0.3489624 0.4292905
sigma 2.024828 1.859407 1.740921
statistic 29.05251 51.92092 36.72974
p.value 5.21002e-07 1.426426e-10 1.752311e-12
df 2 2 3
logLik -202.9341 -194.7523 -187.9179
AIC 411.8681 395.5046 383.8359
BIC 419.5612 403.1976 394.0933
deviance 385.3932 324.9952 281.865
df.residual 94 94 93
The addition of the season category does seem to improve the model. The parallel model is able to explain 43% of the variation while the model with only Pheophytin a as the variable is only able to explain 35% of the variation.
# plot residuals
par(mfrow=c(2,3))
plot(chla_season, pch=16, which=1)
plot(chla_phea, pch=16, which=1)
plot(chla_season.phea, pch=16, which=1)
plot(chla_season, pch=16, which=2)
plot(chla_phea, pch=16, which=2)
plot(chla_season.phea, pch=16, which=2)
par(mfrow=c(1,1))
Left: Chl-a ~ season; Middle: Chl-a ~ Pheophytin a; Right: Parallel model.
# extract intercept and slope values to generate regression lines
season_coef <- data.frame(t(coef(chla_season)))
colnames(season_coef) <- c("intercept", "slope")
season_coef
phea_coef <- data.frame(t(coef(chla_phea)))
colnames(phea_coef) <- c("intercept", "slope")
phea_coef
season.phea_coef <- data.frame(t(coef(chla_season.phea)))
colnames(season.phea_coef) <- c("intercept", "slope.season", "slope.phe")
season.phea_coef
# plot Chl-a vs Pheophytin a with regression lines
plot(wq_prl_model$`Pheophytin a`, wq_prl_model$`Chlorophyll a`,
xlab = "Pheophytin a",
ylab = "Chlorophyll a",
col = wq_prl_model$Season,
pch = 16,
main = "By Season (dry season: black, dry season: red)")
abline(reg = chla_phea, lty = 2)
abline(a = season.phea_coef$intercept, b = season.phea_coef$slope.phe, col = "blue")
The blue line is the regression line from the parallel model while the dashed black line is the regression line for the Chl-a ~ Pheophytin a model. Although not entirely clear from the plot, it seems as though the parallel model is accounting for for the dry season points compared to the univariate model. This may be because of the influence of the season variable in the parallel model. The plot also shows how the dry season may have a higher mean value than the wet season as it’s values seem to be higher than values from the wet season.